Acknowledgement


This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.



Layout of the Report

This report is divided into four sections as listed below:

Section 1.1 - Food Standards Agency Interventions - Technical Development

Section 1.2 - Food Standards Agency Interventions - Professional Report

Section 2.1 - Book Sales - Technical Development

Section 2.2 - Book Sales - Professional Report


Section 1.1 - Food Standards Agency Interventions - Technical Development

This analysis on the Food Standards Agency data fulfills the below mentioned asks from the managers of the agency:

1.1.1 Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels combined.

1.1.2 Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels separately - A to E.

1.1.3 Determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions.

1.1.4 Determine if there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.

1.1.5 Determine if there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority.

Data Dictionary for Food Data:

Below mentioned data dictionary is as per the support sources provided for the question 1. Please note that only Features that are relevant for the analysis are described below:

Feature Description
Country Country of Local Authority
LAType Local Authority Type
LAName Local Authority Name
Total_Establishments Total number of establishments in the area
Percent_Compliant_A_E Total % of Establishments that are broadly compliant with the regulations and are rated from category A to E
A_Rated Number of Establishments that are rated A in the area
Percent_Compliant_A Total % of establishments that are broadly compliant and rated A
B_Rated Number of establishments that are rated B in the area
Percent_Compliant_B Total % of establishments that are broadly compliant and rated B
C_Rated Number of establishments that are rated C in the area
Percent_Compliant_C Total % of establishments that are broadly compliant and rated C
D_Rated Number of establishments that are rated D in the area
Percent_Compliant_D Total % of establishments that are broadly compliant and rated D
E_Rated Number of establishments that are rated E in the area
Percent_Compliant_E Total % of establishments that are broadly compliant and rated E
Interventions_A_E Total % of Interventions that have taken place for all impact levels A to E
Interventions_A Total % of Interventions that have taken place for premises rated A
Interventions_B Total % of Interventions that have taken place for premises rated B
Interventions_C Total % of Interventions that have taken place for premises rated C
Interventions_D Total % of Interventions that have taken place for premises rated D
Interventions_E Total % of Interventions that have taken place for premises rated E
Employees Number of Professional Full Time Posts that are currently occupied in the Local Authority Office

Read the Food Data into the r environment:

food_data <-  read_csv("2019-20-enforcement-data-food-hygiene.csv", guess_max = 500, show_col_types = FALSE)

# Note - The imported dataset has 353 observations, matching with the provided input .csv file
# Basic data checks on the imported data set

# summary check

summary(food_data)

# structure check

str(food_data)

Observations from the above data exploration -

Data Cleaning Steps:

  1. Columns names are not intuitive in the imported data set, therefore, they are renamed. Definitions and updated column names are provided in the above data dictionary.
  2. There are missing values in the imported data set, these values need to be omitted.
  3. Columns such as ‘Interventions_A’ needs to be corrected for the data type as the current data type is Character, it should be numeric.

Note: The imported data set has extra columns which are not needed to perform the required analysis.

Data Cleaning Step 1: Renaming the columns to handle columns efficiently in the r environment:

# Utilising the 'colnames' function to rename the Column names 

colnames(food_data) <-c(
            "Country","LAType","LAName","Total_Establishments","NR_For_Intervention",
            "Outside_Programme","Percent_Compliant_A_E","Percent_Compliant_Inc_NR",
            "A_Rated","Percent_Compliant_A","B_Rated","Percent_Compliant_B","C_Rated",
            "Percent_Compliant_C","D_Rated","Percent_Compliant_D","E_Rated","Percent_Compliant_E",
            "Interventions_A_E","Interventions_A","Interventions_B","Interventions_C","Interventions_D",
            "Interventions_E","Interventions_NR","Volun_Closure","Seizure",
            "Suspension_Revocation_Licence","Hygiene_Emergency_Pro_Notice",
            "Pro_Order","Caution","Hygiene_Notices","Detention_Notices",
            "Written_Warning","Prosecution_Concluded","Employees")

Data Cleaning Step 2: Identifying and Removing the rows with the missing values. Identifying NAs with the help of ‘Interventions_A’.

food_data <- food_data %>%
  filter(!is.na(Interventions_A))

# After removing missing values, we've 347 rows (=353-6)

Data Cleaning Step 3: Correcting for the data types of Percent_Compliant_A and Interventions_A

# Char data type in 'Percent_Compliant_A' because there are 52 entries below with "NP"

food_data %>%
  dplyr::group_by(Percent_Compliant_A) %>%
  dplyr::summarise(n = n()) %>%
  arrange(desc(n))

# Char data type in 'Interventions_A' because there are 24 entries below with "NR"

food_data %>%
  dplyr::group_by(Interventions_A) %>%
  dplyr::summarise(n = n()) %>%
  arrange(desc(n))

# Updating "NP" and "NR" to 0 using dplyr mutate function:

food_data <- food_data %>%
  mutate(Percent_Compliant_A = if_else(Percent_Compliant_A == 'NP','0',Percent_Compliant_A),
         Interventions_A = if_else(Interventions_A == 'NR','0',Interventions_A))


# Updating the data type of Percent_Compliant_A, Interventions_A, and LAType

food_data <- food_data %>%
  mutate(Percent_Compliant_A = as.numeric(Percent_Compliant_A),
         Interventions_A = as.numeric(Interventions_A),
         LAType = as.factor(LAType))
# Checking summary and str to see if all the updates are made correctly:

summary(food_data)

str(food_data)

1.1.1 Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels combined.

ggplot(food_data, aes(x = Interventions_A_E))+
  geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) + 
  stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_A_E), sd = sd(food_data$Interventions_A_E)), color = "blue") +
  labs(x= 'Interventions A-E',y='Density', title = 'Distribution of Enforcement Actions Successfully achieved')

1.1.2 - Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels separately - A to E.

grid.arrange(
  
  ggplot(food_data, aes(Interventions_A))+
    geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) +
    stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_A), sd = sd(food_data$Interventions_A)), color = "blue")
    +labs(x= 'Intervention A', y = 'Density'), 
  
  ggplot(food_data, aes(Interventions_B))+
  geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) +
    stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_B), sd = sd(food_data$Interventions_B)), color = "blue")
    +labs(x= 'Intervention B', y = 'Density'), 
  
  ggplot(food_data, aes(Interventions_C))+
  geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) +
    stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_C), sd = sd(food_data$Interventions_C)), color = "blue")
    +labs(x= 'Intervention C', y = 'Density'),
  
  ggplot(food_data, aes(Interventions_D))+
  geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) +
    stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_D), sd = sd(food_data$Interventions_D)), color = "blue")
    +labs(x= 'Intervention D', y = 'Density'),
  
  ggplot(food_data, aes(Interventions_E))+
  geom_histogram(aes(y = ..density..), binwidth = 1, position = "identity" ,alpha = 0.8) +
    stat_function(fun = dnorm, args = list(mean = mean(food_data$Interventions_E), sd = sd(food_data$Interventions_E)), color = "blue")
    +labs(x= 'Intervention E', y = 'Density'), 
  
  ncol=2, top = textGrob("Percent of enforcement actions successfully achieved - for all impact levels separately - A to E")
)

1.1.3 - Determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions.

Checking visually the correlation between Employees and Interventions A-E:

emp_est_plot <- ggplot(food_data, aes(y=Interventions_A_E, x=Employees)) + geom_point() + labs(x="Employees", y="Interventions A-E", subtitle="The shaded area shows the 95% CI for the best-fitting regression line (r = -0.02)", title = "Variation in Interventions A-E as a function of Number of Employees") + geom_smooth(method=lm)

emp_est_plot
## `geom_smooth()` using formula 'y ~ x'

As seen from the above figure, there is a weak correlation between Employees and Interventions A-E. This weak correlation is confirmed by the below r values:

# Checking the correlation between Employees and Interventions_A_E

rcorr(
  food_data$Employees,
  food_data$Interventions_A_E
  )
##       x     y
## x  1.00 -0.02
## y -0.02  1.00
## 
## n= 347 
## 
## 
## P
##   x      y     
## x        0.6552
## y 0.6552

Utilising the lm model to determine the relationship between Interventions(A-E) and Employees

lm_intervention_employees <- lm(Interventions_A_E ~ Employees, food_data)

summary(lm_intervention_employees)
## 
## Call:
## lm(formula = Interventions_A_E ~ Employees, data = food_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.304  -4.575   4.067   8.658  13.860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  87.1091     1.2828  67.905   <2e-16 ***
## Employees    -0.1195     0.2675  -0.447    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.4 on 345 degrees of freedom
## Multiple R-squared:  0.0005787,  Adjusted R-squared:  -0.002318 
## F-statistic: 0.1998 on 1 and 345 DF,  p-value: 0.6552
# Confirming the above results via anova()

anova(lm_intervention_employees)
## Analysis of Variance Table
## 
## Response: Interventions_A_E
##            Df Sum Sq Mean Sq F value Pr(>F)
## Employees   1     31  30.703  0.1998 0.6552
## Residuals 345  53021 153.683

As evident from the above summary of linear model and anova test - it is unlikely (r= -0.02) that employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions [t(345) = 67.9,p = 0.655)]

1.1.4 - Determine if there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.

# Checking the correlation:

emp_est_LAType_plot <- ggplot(food_data, aes(y=Interventions_A_E, x=Employees, colour = LAType)) + geom_point() + labs(x="Employees", y="Interventions A-E", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Variation in Interventions A-E as a function of Employees for Each LA Type") + geom_smooth(method=lm)

emp_est_LAType_plot
## `geom_smooth()` using formula 'y ~ x'

# Utilising lm model to determine the relationship between interventions_A_E and Employees for each LAType

lm_employees_intervention_LAType <- lm(Interventions_A_E~Employees + LAType, data = food_data)


lm_employees_intervention_LAType_emm <- emmeans(lm_employees_intervention_LAType, ~LAType)
lm_employees_intervention_LAType_emm
##  LAType                       emmean    SE  df lower.CL upper.CL
##  District Council               88.4 0.963 340     86.5     90.2
##  London Borough                 82.5 2.172 340     78.2     86.7
##  Metropolitan Borough Council   84.6 2.111 340     80.4     88.7
##  NI Unitary Authority           88.2 3.716 340     80.9     95.5
##  Unitary Authority              83.2 1.659 340     80.0     86.5
##  Welsh Unitary Authority        89.1 2.742 340     83.7     94.5
## 
## Confidence level used: 0.95
# plots for comparison

ggplot(summary(lm_employees_intervention_LAType_emm), aes(x=LAType, y=emmean, ymin=lower.CL, ymax=upper.CL, colour = LAType)) + geom_point() + geom_linerange() + labs(x="LAType", y="Intervention (A-E)", subtitle="Error Bars are Extent of 95% CIs", title = "CIs for each LA Type") + theme(axis.text.x=element_text(angle = 90, hjust = 0))

1.1.5 - Determine if there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority.

# Calculating a new column having Employees per establishment 

food_data$Employees_Per_Establishment <- (food_data$Employees/food_data$Total_Establishments)
# Plot to visualize the correlation

ggplot(food_data, aes(y=Interventions_A_E, x=Employees_Per_Establishment)) + geom_point() + labs(x="Employees per Establishment", y="Interventions A-E", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Variation in Interventions A-E as a function of Employees per Establishment") + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

# Checking the correlation coefficient between Interventions_A_E and Employees_Per_Establishment

rcorr(
  food_data$Interventions_A_E, 
  food_data$Employees_Per_Establishment
  )
##      x    y
## x 1.00 0.23
## y 0.23 1.00
## 
## n= 347 
## 
## 
## P
##   x  y 
## x     0
## y  0

As seen form the above figure and correlation values, there is a positive correlation between Employees/Establishment and Interventions A_E.

#lm model:

lm_intervention_emp_per_est <- lm(Interventions_A_E ~ Employees_Per_Establishment, food_data)
summary(lm_intervention_emp_per_est)
## 
## Call:
## lm(formula = Interventions_A_E ~ Employees_Per_Establishment, 
##     data = food_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.529  -5.642   4.338   8.191  15.913 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   78.943      1.879  42.009  < 2e-16 ***
## Employees_Per_Establishment 2816.377    647.165   4.352 1.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.07 on 345 degrees of freedom
## Multiple R-squared:  0.05204,    Adjusted R-squared:  0.04929 
## F-statistic: 18.94 on 1 and 345 DF,  p-value: 1.781e-05
anova(lm_intervention_emp_per_est)
## Analysis of Variance Table
## 
## Response: Interventions_A_E
##                              Df Sum Sq Mean Sq F value    Pr(>F)    
## Employees_Per_Establishment   1   2761 2760.70  18.939 1.781e-05 ***
## Residuals                   345  50291  145.77                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 1.2 - Food Standards Agency Interventions - Professional Report

This section of the report aims to provide the conclusions of the technical development and answer the requirements of the politicians and managers of the food agency.

Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels combined

Below figure shows the distribution of percent of enforcement actions successfully achieved for all impact levels combined for each of the LA Type. As evident from the distribution, each of the LA Type peaks in the range of 85-100%.

Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels separately - A to E

Below figure shows the distribution of percent of enforcement actions successfully achieved for all impact levels separately ( from A to E). Frequency distribution at each of the impact levels is skewed towards the right hand side. Also, the frequency distribution for Intervention A is highly non-uniform with most values clustered around 95-100% range. Frequency distribution for Intervention E is the least clustered of all.

Determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions

In order to determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions, we have utilised the linear model. As seen form the below figure, there is weak correlation between employees and successful response to enforcement actions. Analysis form the linear model shows that, Employees is not a significant predictor for successful response to the enforcement actions, t(345) = -0.447, p = 0.655.

Therefore, it is unlikely that employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions.

## `geom_smooth()` using formula 'y ~ x'

Determine if there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority

Below figure shows a weak correlation between Employees and Interventions A to E.

## `geom_smooth()` using formula 'y ~ x'

The table below shows the average proportion of successful responses for each of the LA type along with the associated confidence intervals. It should be noted that the upper and lower bound of confidence intervals are overlapping, which is displyed in the figure below.

##  LAType                       emmean    SE  df lower.CL upper.CL
##  District Council               88.4 0.963 340     86.5     90.2
##  London Borough                 82.5 2.172 340     78.2     86.7
##  Metropolitan Borough Council   84.6 2.111 340     80.4     88.7
##  NI Unitary Authority           88.2 3.716 340     80.9     95.5
##  Unitary Authority              83.2 1.659 340     80.0     86.5
##  Welsh Unitary Authority        89.1 2.742 340     83.7     94.5
## 
## Confidence level used: 0.95

Determine if there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority

The below figure shows the correlation between Interventions A to E and the number of employees per establishment. There seems to be a positive correlation which is further investigated via anova test.

## `geom_smooth()` using formula 'y ~ x'

As shown below, Employees per Establishment is statistically significant predictor of interventions ( p < 0.005). Therefore, to conclude - proportion of successful responses increases as the proportion of the number of employees/establishments increases.

## Analysis of Variance Table
## 
## Response: Interventions_A_E
##                              Df Sum Sq Mean Sq F value    Pr(>F)    
## Employees_Per_Establishment   1   2761 2760.70  18.939 1.781e-05 ***
## Residuals                   345  50291  145.77                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 2.1 - Book Sales - Technical Development


This analysis on the Book Sales data fulfills the below mentioned asks from the managers of a publishing company:

2.1.1 Do books from different genres have different daily sales on average?

2.1.2 Do books have more/fewer sales depending upon their average review scores and total number of reviews?

2.1.3 What is the effect of sale price upon the number of sales, and is this different across genres?



Data Dictionary for Book Sales Data:

Below mentioned data dictionary is as per the description provided for the question 2:

Feature Description
sold.by book seller name
publisher.type type of publisher i.e. amazon, big five, indie, single author, small/medium
genre Genre of the book i.e. childrens, fiction and non-fiction
avg.review average review of the book out of 5
daily.sales average number of sales (minus refunds) across all days in the period
total.reviews total no. reviews for the particular book
sale.price average price for which the book sold across all sales in the period

Renamed “sold by” to “sold.by” in the input csv file, to avoid run time errors in r environment.

Read Book Sales Data into the r environment:

sales_data <- read_csv("publisher_sales.csv", guess_max = 10000, show_col_types = FALSE)

# Note - no. of rows and columns imported are checked against the csv file

Question 2.1.1 - Do books from different genres have different daily sales on average?

Summary statistics:

book_data_summary <- sales_data %>%
  dplyr::group_by(genre) %>%
  dplyr::summarize(mean=mean(daily.sales),median = median(daily.sales),n = n())

book_data_summary
## # A tibble: 3 × 4
##   genre        mean median     n
##   <chr>       <dbl>  <dbl> <int>
## 1 childrens    55.6   55.4  2000
## 2 fiction     106.   106.   2000
## 3 non_fiction  75.9   75.9  2000

Analyzing genre vs daily.sales plot

ggplot(sales_data, aes(x=genre, y = daily.sales))+
  geom_boxplot()+
  labs(title="Daily Sales by Genre", x="Genre", y="Daily Sales")

Below-mentioned modifications are done to the imported data set to make it suitable for the analysis:

  1. Data entry in the non_fiction genre with daily sales as -0.53, updated it to 0.

  2. All the character data types are converted to factor data types.

# Updating daily sales value

sales_data <- sales_data %>%
  mutate(daily.sales = ifelse(daily.sales<0,0,daily.sales))

# Updating the data type

# generating a vector to keep the column names
columns <- c("sold.by", "publisher.type", "genre")

# Set the correct measurement levels or data types
sales_data[columns] <- lapply(sales_data[columns], as.factor)
# Checking if the correct data type corrections are made or not

str(pull(sales_data,sold.by))
##  Factor w/ 13 levels "Amazon Digital Services,  Inc.",..: 11 1 1 1 13 13 1 6 1 1 ...
str(pull(sales_data,publisher.type))
##  Factor w/ 5 levels "amazon","big five",..: 2 3 5 5 2 2 5 2 5 5 ...
str(pull(sales_data,genre))
##  Factor w/ 3 levels "childrens","fiction",..: 1 3 3 2 1 1 2 1 2 1 ...
# all character data types are converted to factor

Creating the data set for the requirement 2.1

# Filtering out genre and daily.sales

daily_sales_by_genre <- sales_data %>%
  select(genre,daily.sales)
# lm model for predicting daily sales based on genre

lm_daily_sales_genre <- lm(daily.sales~genre, data=daily_sales_by_genre)

Utilising the anova function to test whether genre has a significant effect on the daily sales.

anova(lm_daily_sales_genre)
## Analysis of Variance Table
## 
## Response: daily.sales
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## genre        2 2562524 1281262  2590.6 < 2.2e-16 ***
## Residuals 5997 2966052     495                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Using the NHST approach

summary(lm_daily_sales_genre)
## 
## Call:
## lm(formula = daily.sales ~ genre, data = daily_sales_by_genre)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.396  -13.326   -0.076   13.249  102.094 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       55.5773     0.4973  111.76   <2e-16 ***
## genrefiction      50.3087     0.7033   71.53   <2e-16 ***
## genrenon_fiction  20.2888     0.7033   28.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.24 on 5997 degrees of freedom
## Multiple R-squared:  0.4635, Adjusted R-squared:  0.4633 
## F-statistic:  2591 on 2 and 5997 DF,  p-value: < 2.2e-16

Marginal means:

lm_daily_sales_genre_emm <- emmeans(lm_daily_sales_genre, ~genre)
lm_daily_sales_genre_emm
##  genre       emmean    SE   df lower.CL upper.CL
##  childrens     55.6 0.497 5997     54.6     56.6
##  fiction      105.9 0.497 5997    104.9    106.9
##  non_fiction   75.9 0.497 5997     74.9     76.8
## 
## Confidence level used: 0.95

Also, from the summary() approach, we are not able to calculate the difference between fiction and non_fiction genre. Therefore,

  1. using pairs function, to see if the differences among the genre are significant

  2. using the contrast function in conjunction with confint to calculate the mean differences with associated confidence intervals

lm_daily_sales_genre_pairs <- pairs(lm_daily_sales_genre_emm)
lm_daily_sales_genre_pairs
##  contrast                estimate    SE   df t.ratio p.value
##  childrens - fiction        -50.3 0.703 5997 -71.535  <.0001
##  childrens - non_fiction    -20.3 0.703 5997 -28.849  <.0001
##  fiction - non_fiction       30.0 0.703 5997  42.686  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

As evident from the above calculations based on Tukey test, the differences among each of the genre is significant (p<0.0001).

lm_daily_sales_genre_contrast <- confint(pairs(lm_daily_sales_genre_emm))
lm_daily_sales_genre_contrast
##  contrast                estimate    SE   df lower.CL upper.CL
##  childrens - fiction        -50.3 0.703 5997    -52.0    -48.7
##  childrens - non_fiction    -20.3 0.703 5997    -21.9    -18.6
##  fiction - non_fiction       30.0 0.703 5997     28.4     31.7
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
# plots for comparison

p.gain <- ggplot(summary(lm_daily_sales_genre_emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genre", y="Average Daily Sales", subtitle="Error Bars are Extent of 95% CIs", title = "Distribution across Genre")

p.contrasts <- ggplot(lm_daily_sales_genre_contrast, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() +labs(x="Genre - Contrast", y="Average Daily Sales", subtitle="Error Bars are Extent of 95% CIs",title = "Distribution - Genre Contrast")+scale_x_discrete(guide = guide_axis(n.dodge = 2))

grid.arrange(p.gain, p.contrasts, ncol=2)

Visualizing the distribution of daily sales across genre with the help of Null Hypothesis model and Alternative Hypothesis model

Step 1 - Arguments for Null Hypothesis and Alternative Hypothesis model

# Arguments for Null Hypothesis Model

m.daily.sales.intercept <- lm(daily.sales~1, data=daily_sales_by_genre)

null.mean <- coef(m.daily.sales.intercept)
null.sd <- sigma(m.daily.sales.intercept)

# Arguments for Alternative Hypothesis Model - Running lm model and calculating emmeans

alternative.means <- summary(lm_daily_sales_genre_emm)$emmean
alternative.sd <- sigma(lm_daily_sales_genre)

Step 2 - Visually Comparing Null Hypothesis and Alternative Hypothesis model.

# Checking the Null Hypothesis Distribution:

null_plot <- ggplot(daily_sales_by_genre,aes(x=daily.sales, fill = genre))+
  geom_histogram(aes(y=..density..),position="identity", alpha=0.3, binwidth=1)+
  stat_function(fun=dnorm, args=list(mean=null.mean, sd=null.sd)) + labs(x="Daily Sales", y="Density", title="Null Hypothesis", fill="Genre")

# Checking the Alternative Hypothesis Distribution:

colours <- scales::hue_pal()(3)
alt_plot <- ggplot(daily_sales_by_genre, aes(x=daily.sales, fill=genre))+
  geom_histogram(aes(y=..density..), position="identity", alpha=0.3, binwidth=1)+ 
  labs(x="Daily Sales", y="Density", title="Alternative Hypothesis", fill="Origin") +
    stat_function(fun=dnorm, args=list(mean=alternative.means[1], sd=alternative.sd), col=colours[1]) +
    stat_function(fun=dnorm, args=list(mean=alternative.means[2], sd=alternative.sd), col=colours[2]) +
    stat_function(fun=dnorm, args=list(mean=alternative.means[3], sd=alternative.sd), col=colours[3]) 

grid.arrange(null_plot, alt_plot, ncol=1)

Above plots confirms our finding that three genre differs with respect to mean. Therefore, the Alternative Hypothesis plot governs our actual distribution.

Question 2.1.2 - Do books have more/fewer sales depending upon their average review scores and total number of reviews?

For this ask, we need to perform the Multiple Linear Regression. We will first get the overview of the underlying data by performing following steps:

  1. Evaluate the correlation matrix

  2. Individual effect of average review scores on daily sales

  3. Individual effect of total reviews on daily sales

  4. Multiple regression - determining main effects and interactions effects

Checking the data distribution of daily.sales with respect to total reviews and average reviews

sales_tot_rev_plot <- ggplot(sales_data, aes(y=daily.sales, x=total.reviews)) + geom_point() + labs(x="Total Reviews", y="Daily Sales", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Daily Sales vs Total Reviews") + geom_smooth(method=lm)

sales_avg_rev_plot <- ggplot(sales_data, aes(y=daily.sales, x=avg.review)) + geom_point() + labs(x="Average Reviews", y="Daily Sales", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Daily Sales vs Average Reviews") + geom_smooth(method=lm)

sales_dist <- grid.arrange(sales_tot_rev_plot, sales_avg_rev_plot, ncol=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

As seen in the above Figure, Daily sales is positively correlated with Total Reviews and not correlated with Average Reviews.

Creating a data frame for daily sales, average reviews, and total reviews

# Selecting sales and avg review and total reviews columns

sales_rev_df <- sales_data %>%
  select(daily.sales,avg.review,total.reviews)

Checking the correlation matrix to see the correlation between each pair of variables and a p-value:

sales_corr_matrix <- rcorr(as.matrix(sales_rev_df))
sales_corr_matrix
##               daily.sales avg.review total.reviews
## daily.sales          1.00        0.0          0.66
## avg.review           0.00        1.0          0.10
## total.reviews        0.66        0.1          1.00
## 
## n= 6000 
## 
## 
## P
##               daily.sales avg.review total.reviews
## daily.sales               0.747      0.000        
## avg.review    0.747                  0.000        
## total.reviews 0.000       0.000

Determining the individual effect of Average Reviews and Total Reviews on daily sales:

First, effect of average review scores on daily sales through lm() function

# Linear model for daily sales vs avg reviews

lm_sales_avg_rev <- lm(daily.sales~avg.review, data = sales_rev_df)
summary(lm_sales_avg_rev)
## 
## Call:
## lm(formula = daily.sales ~ avg.review, data = sales_rev_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.414 -22.299  -4.837  18.943 128.948 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.0533     2.9509  27.128   <2e-16 ***
## avg.review   -0.2211     0.6854  -0.323    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.36 on 5998 degrees of freedom
## Multiple R-squared:  1.735e-05,  Adjusted R-squared:  -0.0001494 
## F-statistic: 0.1041 on 1 and 5998 DF,  p-value: 0.747
# 95% CI for average reviews

cbind(coefficient=coef(lm_sales_avg_rev), confint(lm_sales_avg_rev))
##             coefficient     2.5 %    97.5 %
## (Intercept)  80.0533475 74.268426 85.838269
## avg.review   -0.2211249 -1.564779  1.122529

Second, effect of total reviews on daily sales through lm() function

# Linear model for daily sales vs total reviews

lm_sales_tot_rev <- lm(daily.sales~total.reviews, data = sales_rev_df)
summary(lm_sales_tot_rev)
## 
## Call:
## lm(formula = daily.sales ~ total.reviews, data = sales_rev_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.203  -14.824   -1.026   13.620  138.424 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.875791   1.077625   7.308 3.05e-13 ***
## total.reviews 0.537047   0.007818  68.695  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.71 on 5998 degrees of freedom
## Multiple R-squared:  0.4403, Adjusted R-squared:  0.4402 
## F-statistic:  4719 on 1 and 5998 DF,  p-value: < 2.2e-16
# 95% CI for total reviews

cbind(coefficient=coef(lm_sales_tot_rev), confint(lm_sales_tot_rev))
##               coefficient     2.5 %    97.5 %
## (Intercept)     7.8757911 5.7632580 9.9883242
## total.reviews   0.5370474 0.5217215 0.5523733

Multiple Regression - checking the combined effect of average reviews scores and total reviews on the daily sales:

# lm model to predict the daily sales as a function of average and total reviews

lm_sales_avg_tot_rev <- lm(daily.sales ~ avg.review + total.reviews, data = sales_rev_df)

summary(lm_sales_avg_tot_rev)
## 
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = sales_rev_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.396  -14.645   -1.059   13.690  122.428 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.872183   2.341239  10.196  < 2e-16 ***
## avg.review    -3.943920   0.513113  -7.686 1.76e-14 ***
## total.reviews  0.543329   0.007823  69.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.6 on 5997 degrees of freedom
## Multiple R-squared:  0.4458, Adjusted R-squared:  0.4456 
## F-statistic:  2412 on 2 and 5997 DF,  p-value: < 2.2e-16

The results from the above table show that the daily sales decreases by statistically significant amount of £3.94 t(5997) = -7.7, p < 0.0001, for every extra increase in average reviews, holding the total reviews constant. Also, when controlling the average reviews, the daily sales increase by statistically significant amount of £0.54 t(5997) = 69.5, p< 0.0001, for every extra increase in total reviews.

# Generating the confidence intervals for the estimations approach

cbind(coefficient=coef(lm_sales_avg_tot_rev), confint(lm_sales_avg_tot_rev))
##               coefficient      2.5 %     97.5 %
## (Intercept)     23.872183 19.2825123 28.4618540
## avg.review      -3.943920 -4.9498061 -2.9380331
## total.reviews    0.543329  0.5279928  0.5586651

The above 95% CI for avg.reviews and total.reviews follows a very stringent range and does not include 0. Therefore, these values are significantly different from zero.

Visualizing the surface plots for the main effect of avg.reviews and total.reviews on daily.sales:

sales_preds <- tibble(avg.review = unlist(expand.grid(seq(0,5,0.5), seq(0, 250, 5))[1]),
                         total.reviews = unlist(expand.grid(seq(0, 20, 2), seq(0, 250, 5))[2]))

sales_preds <- mutate(sales_preds,
                         sales_hat = predict(lm_sales_avg_tot_rev, sales_preds))

ggplot(sales_preds, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = sales_hat)) + guides(fill=guide_legend(title="Daily Sales")) + labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews")

Checking if there is an interaction between total reviews and average reviews. Running a lm() model including interaction effects as well in addition to main effects.

sales_intr <- lm(daily.sales ~ avg.review*total.reviews, data = sales_rev_df)

summary(sales_intr)
## 
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = sales_rev_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104.08  -14.63   -0.92   13.82   92.33 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               63.547788   4.177990  15.210  < 2e-16 ***
## avg.review               -13.683943   0.993145 -13.778  < 2e-16 ***
## total.reviews              0.164761   0.034067   4.836 1.36e-06 ***
## avg.review:total.reviews   0.091687   0.008035  11.411  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.36 on 5996 degrees of freedom
## Multiple R-squared:  0.4576, Adjusted R-squared:  0.4573 
## F-statistic:  1686 on 3 and 5996 DF,  p-value: < 2.2e-16

As seen from the above results, there is a statistically significant positive interaction between average reviews and total reviews when predicting the daily sales.

Determining if the interaction effects are significant

vif(lm_sales_avg_tot_rev)
##    avg.review total.reviews 
##      1.011033      1.011033

Since the VIF scores are less than 5, therefore it is valid to consider both average and total reviews for the daily sales prediction.

Also, let’s check the model comparison via ANOVA() test to check if having additional complexity of considering interaction between total reviews and average reviews makes sense

anova(lm_sales_avg_tot_rev, sales_intr)
## Analysis of Variance Table
## 
## Model 1: daily.sales ~ avg.review + total.reviews
## Model 2: daily.sales ~ avg.review * total.reviews
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   5997 3064016                                  
## 2   5996 2998894  1     65122 130.21 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing the main effect and interaction effect graphically:

intr_surf_data <- tibble(avg.review = unlist(expand.grid(seq(0,5,0.5), seq(0, 250, 5))[1]),
                         total.reviews = unlist(expand.grid(seq(0, 20, 2), seq(0, 250, 5))[2]))

intr_surf_data <- mutate(intr_surf_data,
                         main.hat = predict(lm_sales_avg_tot_rev, intr_surf_data),
                         intr.hat = predict(sales_intr, intr_surf_data))

surf.main <- ggplot(intr_surf_data, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = main.hat)) + guides(fill=guide_legend(title="Daily Sales"))+ labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews",subtitle = "Main Effects")


surf.intr <- ggplot(intr_surf_data, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = intr.hat))  + guides(fill=guide_legend(title="Daily Sales"))+ labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews", subtitle = "Interaction Effects")

sales_rev_main_int_plot <- grid.arrange(surf.main, surf.intr, nrow = 2)

Question 2.1.3 - What is the effect of sale price upon the number of sales, and is this different across genres?

Creating the data frame by selecting sale.price, daily.sales and genre

sale_price_df <- sales_data %>%
  select(sale.price,daily.sales,genre)

Checking the distribution for sales vs sale price across genre

sp_sales_all <- ggplot(sale_price_df, aes(y=daily.sales, x=sale.price)) + geom_point(alpha = 0.15) + labs(x="Sales Price", y="Daily Sales", subtitle="Effect of Sales Price on Daily Sales for all Genre") + geom_smooth(method=lm)

sp_sales_genre <- ggplot(sale_price_df, aes(y=daily.sales, x=sale.price, color = genre)) + geom_point(alpha = 0.15) + labs(x="Sales Price", y="Daily Sales", subtitle="Effect of Sales Price on Daily Sales by Genre") + geom_smooth(method=lm)

sale_price_dist <- grid.arrange(sp_sales_all, sp_sales_genre, ncol=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

As seen from the above figure, Sales Price is negatively correlated with Daily Sales.

As shown in the below correlation matrix, daily sales and sales price are negatively correlated.

cor(select(sale_price_df,daily.sales,sale.price))
##             daily.sales sale.price
## daily.sales   1.0000000 -0.2776016
## sale.price   -0.2776016  1.0000000

Utilising the lm() function to determine the effect of sales price on daily sales

lm_sales_sp <- lm(daily.sales~sale.price, data = sale_price_df)

summary(lm_sales_sp)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = sale_price_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.769 -20.650  -4.634  17.099 130.315 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 112.0818     1.5207   73.70   <2e-16 ***
## sale.price   -3.8156     0.1705  -22.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.17 on 5998 degrees of freedom
## Multiple R-squared:  0.07706,    Adjusted R-squared:  0.07691 
## F-statistic: 500.8 on 1 and 5998 DF,  p-value: < 2.2e-16
# Confidence Intervals:
cbind(coefficient=coef(lm_sales_sp), confint(lm_sales_sp))
##             coefficient      2.5 %     97.5 %
## (Intercept)  112.081751 109.100621 115.062881
## sale.price    -3.815582  -4.149821  -3.481342
# emmeans:

(lm_sales_sp_emm <- emmeans(lm_sales_sp, ~sale.price))
##  sale.price emmean    SE   df lower.CL upper.CL
##        8.64   79.1 0.377 5998     78.4     79.8
## 
## Confidence level used: 0.95

The results from the above table shows that there is significant relationship between daily sales and sales price. There is an average decrease of 4 books for every increase in sales price (95% CI =[-4.15,-3.48]). This decrease is significantly different from zero, t(5998)=-22.38, p<.0001.

Determining this effect across the genre - including genre as well in the predictors:

lm_sales_sp_genre <- lm(daily.sales~sale.price + genre, data = sale_price_df)

(  lm_sales_sp_genre_emm <- emmeans(lm_sales_sp_genre, ~genre)  )
##  genre       emmean    SE   df lower.CL upper.CL
##  childrens     56.7 0.532 5996     55.7     57.7
##  fiction      105.4 0.504 5996    104.4    106.4
##  non_fiction   75.3 0.507 5996     74.3     76.3
## 
## Confidence level used: 0.95

Using anova() test to compare the above two models i.e. without and with controlling for genre:

anova(lm_sales_sp, lm_sales_sp_genre)
## Analysis of Variance Table
## 
## Model 1: daily.sales ~ sale.price
## Model 2: daily.sales ~ sale.price + genre
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   5998 5102529                                  
## 2   5996 2949564  2   2152966 2188.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Therefore the model with sales price and genre as predictors is significantly better than the model having sale price only as the predictor.

Checking if daliy sales is different across the different genre:

lm_daily_sales_sp_genre_pairs <- pairs(lm_sales_sp_genre_emm)
lm_daily_sales_sp_genre_pairs
##  contrast                estimate    SE   df t.ratio p.value
##  childrens - fiction        -48.7 0.756 5996 -64.360  <.0001
##  childrens - non_fiction    -18.6 0.762 5996 -24.344  <.0001
##  fiction - non_fiction       30.1 0.702 5996  42.922  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Section 2.2 - Book Sales - Professional Report

This section of the report aims to provide the conclusions of the technical development and answer the requirements of the managers of the publishing company.

The table below (Table 2.2.1) shows the high-level summary statistics of the input book sales data provided to us. The data is evenly distributed in terms of number of records for each genre; also close proximity of mean and median indicates fairly symmetric distribution.

Table 2.2.1 Summary Statistics for Daily Sales across Genre
Genre Mean Median Records
childrens 55.57726 55.390 2000
fiction 105.88591 106.005 2000
non_fiction 75.86584 75.920 2000

As the first step of the analyses, below figure helps us to visualise the Daily Sales across Genre. As evident form the below box-plot, there is one data entry in the non_fiction genre with daily sales as negative value, it makes more business sense to update this value to 0 daily sales for further analysis.

Do books from different genres have different daily sales on average?

To answer this question, we first carried out the anova test and the results are summarised below:

The Daily Sales differs significantly across genre, [F(2,5997)=2591, p < 0.0001]. Similarly, using the summary() function or NHST hints towards the significant differences, taking childrens genre as a reference.

In order to know the individual average sales - we have utilised the estimated marginal means approach. In addition to the means corresponding to each of the genre, we now have associated confidence intervals as well as shown below:

##  genre       emmean    SE   df lower.CL upper.CL
##  childrens     55.6 0.497 5997     54.6     56.6
##  fiction      105.9 0.497 5997    104.9    106.9
##  non_fiction   75.9 0.497 5997     74.9     76.8
## 
## Confidence level used: 0.95

The left panel of the below figure shows the mean daily sales from each genre. Fiction has the highest sales, followed by non_fiction and childrens genre has the lowest daily sales. The right panel shows the estimates of the difference in daily sales for each pair of genre. For instance, the estimate for the fiction versus non_fiction comparison shows a point estimate of £30 greater gain for non_fiction 95% CI [28.4–31.7].

Below plot shows two possible hypothesis for the concerned requirement. Based on our analysis, the Alternative Hypothesis plot governs our actual distribution and we can reject the Null Hypothesis.

Do books have more/fewer sales depending upon their average review scores and total number of reviews?

To answer this question, we started with looking at the data distribution. As seen in the below figure, daily sales and total reviews are positively correlated (r = 0.66, p < 0.05); while daily sales and average reviews are weakly correlated (r= -0.004, p > 0.05).

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Effect of Total Reviews on Daily Sales:

Our analysis shows that there is significant relationship between daily sales and total reviews. There is an average increase of £0.53 sales for every increase in the total review (95% CI = [0.52,0.55]). This increase is significantly different from zero, t(5998)=68.7, p<.0001.

Effect of Average Reviews on Daily Sales:

Our analysis shows that the relationship between daily sales and average reviews is not significantly different from zero. There is an average decrease of £0.22 sales for every point increase in the average review. However, the confidence intervals include zero (95% CI = [-1.6, 1.1]) and this decrease is not significantly different from zero, t(5998)=−0.32, p=0.747.

Effect of Total Reviews and Average Reviews on Daily Sales:

Our analysis shows that the daily sales decreases by statistically significant amount of £3.94 t(5997) = -7.7, p < 0.0001, for every extra increase in average reviews, holding the total reviews constant. Also, when controlling the average reviews, the daily sales increase by statistically significant amount of £0.54 t(5997) = 69.5, p< 0.0001, for every extra increase in total reviews.

Also, our analysis shows that there is a statistically significant positive interaction between average reviews and total reviews when predicting the daily sales. We have utilised anova test to check if having additional complexity of considering interaction between total reviews and average reviews makes sense.

Model comparison shows that a regression model including total reviews, average reviews, and interaction of total and average results in a significantly better overall fit than a model only including total reviews and average reviews F(5996)=130.21, p< 0.0001.

Below visualisation concludes the effect of average and total reviews on daily sales - as the average review of a book increases there is steeper increase in daily sales value as total number of reviews increases. For instance, at an average review of 4.5, we can cover the entire range of sales [from dark blue surface to yellow surface] as we increase the total reviews.

What is the effect of sale price upon the number of sales, and is this different across genres?

As seen from the below distribution for sales vs sale price (£) across genre, sales price is negatively correlated with Daily Sales.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Our analysis shows that there is significant relationship between daily sales and sales price. There is an average decrease of four books for every increase in sales price (95% CI =[-4.15,-3.48]). This decrease is significantly different from zero, t(5998)=-22.38, p<.0001.

Utilised anova test determine to compares two possible models i.e. without and with controlling for genre. The results from the model with sales price and genre as predictors is significantly better than the model having sale price only as the predictor. Based on the results shown below we can conclude that the daliy sales is different across the different genre.

##  contrast                estimate    SE   df t.ratio p.value
##  childrens - fiction        -48.7 0.756 5996 -64.360  <.0001
##  childrens - non_fiction    -18.6 0.762 5996 -24.344  <.0001
##  fiction - non_fiction       30.1 0.702 5996  42.922  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates